Language models have been shown to perform better with an increase in scale on a wide variety of tasks via the in-context learning paradigm. In this paper, we investigate the hypothesis that the ability of a large language model to in-context learn-perform a task is not uniformly spread across all of its underlying components. Using a 66 billion parameter language model (OPT-66B) across a diverse set of 14 downstream tasks, we find this is indeed the case: $\sim$70% of attention heads and $\sim$20% of feed forward networks can be removed with minimal decline in task performance. We find substantial overlap in the set of attention heads (un)important for in-context learning across tasks and number of in-context examples. We also address our hypothesis through a task-agnostic lens, finding that a small set of attention heads in OPT-66B score highly on their ability to perform primitive induction operations associated with in-context learning, namely, prefix matching and copying. These induction heads overlap with task-specific important heads, suggesting that induction heads are among the heads capable of more sophisticated behaviors associated with in-context learning. Overall, our study provides several insights that indicate large language models may be under-trained to perform in-context learning and opens up questions on how to pre-train language models to more effectively perform in-context learning.
translated by 谷歌翻译
End-to-end speech recognition models trained using joint Connectionist Temporal Classification (CTC)-Attention loss have gained popularity recently. In these models, a non-autoregressive CTC decoder is often used at inference time due to its speed and simplicity. However, such models are hard to personalize because of their conditional independence assumption that prevents output tokens from previous time steps to influence future predictions. To tackle this, we propose a novel two-way approach that first biases the encoder with attention over a predefined list of rare long-tail and out-of-vocabulary (OOV) words and then uses dynamic boosting and phone alignment network during decoding to further bias the subword predictions. We evaluate our approach on open-source VoxPopuli and in-house medical datasets to showcase a 60% improvement in F1 score on domain-specific rare words over a strong CTC baseline.
translated by 谷歌翻译
自动语音识别(ASR)系统已经发现它们在非常多样化的域中的众多工业应用中使用。由于域 - 特定于域的系统比域名评估的通用对应力更好,因此对内存和计算有效的域适应的需要是显而易见的。特别是,适用用于救援ASR假设的基于参数的基于变压器的语言模型是具有挑战性的。在这项工作中,我们引入域提示,一种方法,该方法列举了少数域令牌嵌入参数以将基于变压器的LM归入特定域。只需少数额外的额外参数,我们通过使用未存在的LM的基线达到7-14%的效率。尽管具有参数效率,但这些改进与具有数亿参数的完全精细调谐模型的改进相当。通过提示,数据集大小,初始化和域的消融,我们提供了在ASR系统中使用域提示的优势的证据。
translated by 谷歌翻译
In this paper, we introduce a novel network that generates semantic, instance, and part segmentation using a shared encoder and effectively fuses them to achieve panoptic-part segmentation. Unifying these three segmentation problems allows for mutually improved and consistent representation learning. To fuse the predictions of all three heads efficiently, we introduce a parameter-free joint fusion module that dynamically balances the logits and fuses them to create panoptic-part segmentation. Our method is evaluated on the Cityscapes Panoptic Parts (CPP) and Pascal Panoptic Parts (PPP) datasets. For CPP, the PartPQ of our proposed model with joint fusion surpasses the previous state-of-the-art by 1.6 and 4.7 percentage points for all areas and segments with parts, respectively. On PPP, our joint fusion outperforms a model using the previous top-down merging strategy by 3.3 percentage points in PartPQ and 10.5 percentage points in PartPQ for partitionable classes.
translated by 谷歌翻译
从示范中学习(LFD)方法使最终用户能够通过演示所需的行为来教机器人新任务,从而使对机器人技术的访问民主化。但是,当前的LFD框架无法快速适应异质的人类示范,也无法在无处不在的机器人技术应用中进行大规模部署。在本文中,我们提出了一个新型的LFD框架,快速的终身自适应逆增强学习(FLAIR)。我们的方法(1)利用策略来构建政策混合物,以快速适应新的示范,从而快速最终用户个性化; (2)提炼跨示范的常识,实现准确的任务推断; (3)仅在终身部署中需要扩展其模型,并保持一套简洁的原型策略,这些策略可以通过政策混合物近似所有行为。我们从经验上验证了能力可以实现适应能力(即机器人适应异质性,特定用户特定的任务偏好),效率(即机器人实现样本适应性)和可伸缩性(即,模型都会与示范范围增长,同时保持高性能)。 Flair超过了三个连续控制任务的基准测试,其政策收益的平均提高了57%,使用策略混合物进行示范建模所需的次数少78%。最后,我们在现实机器人乒乓球任务中展示了Flair的成功。
translated by 谷歌翻译
快速领域适应的能力对于增加增强学习(RL)对现实世界问题的适用性很重要。RL代理的概括对于在现实世界中的成功至关重要,但是零射击政策转移是一个具有挑战性的问题,因为即使是轻微的视觉变化也可能使训练有素的代理在新任务中完全失败。我们提出了USRA:在数据增强下的统一状态表示学习,这是一个代表学习框架,通过对其观察结果进行数据增强来学习潜在的统一状态表示,以提高其推广到看不见的目标域的能力。我们在Walker环境中展示了我们的方法在DeepMind控制概括基准上的成功,并发现USRA可实现更高的样本效率,而与最佳基线结果相比,USRA可以提高样品效率和14.3%的适应性性能。
translated by 谷歌翻译
图像语义分割的最新方法涉及计算密集的神经网络体系结构。这些方法中的大多数由于内存和其他计算问题而无法适应高分辨率图像分割。文献中的典型方法涉及神经网络体系结构的设计,这些神经网络体系结构可以从低分辨率图像和高分辨率对应物中的本地信息中融合全球信息。但是,设计用于处理高分辨率图像的体系结构是不必要的复杂的,并且涉及许多可能难以调整的超级参数。同样,这些架构中的大多数都需要对高分辨率图像进行训练的地面真理注释,这很难获得。在本文中,我们基于数学形态(MM)操作员开发了强大的管道,该管道可以无缝地将任何现有的语义分割算法扩展到高分辨率图像。我们的方法不需要高分辨率图像的地面真相注释。它基于有效利用低分辨率对应物中的信息以及有关高分辨率图像的梯度信息。我们使用传统的形态算子从低分辨率图像上的推断标签中获得高质量的种子,并使用随机助行器传播种子标签,以优化边界的语义标签。我们表明,通过我们的方法获得的语义分割结果击败了高分辨率图像上现有的最新算法。我们从经验上证明了我们对管道中使用的超级参数的鲁棒性。此外,我们表征了我们的管道适用的一些必要条件,并对拟议方法提供了深入的分析。
translated by 谷歌翻译